Search Engine Technology and Digital Libraries: Moving from Theory to Practice

نویسندگان

  • Friedrich Summann
  • Norbert Lossau
چکیده

This article describes the journey from the conception of and vision for a modern searchengine-based search environment to its technological realisation. In doing so, it takes up the thread of an earlier article on this subject, this time from a technical viewpoint. As well as presenting the conceptual considerations of the initial stages, this article will principally elucidate the technological aspects of this journey. The conception of an academic search engine The starting point for the deliberations about development of an academic search engine was the experience we gained through the generally successful project "Digital Library NRW", in which from 1998 to 2000—with Bielefeld University Library in overall charge—we designed a system model for an Internet-based library portal with an improved academic search environment at its core. At the heart of this system was a metasearch with an availability function, to which we added a user interface integrating all relevant source material for study and research. The deficiencies of this approach were felt soon after the system was launched in June 2001. There were problems with the stability and performance of the database retrieval system, with the integration of full-text documents and Internet pages, and with acceptance by users, because users are increasingly performing the searches themselves using search engines rather than going to the library for help in doing searches. Since a long list of problems are also encountered using commercial search engines for academic use (in particular the retrieval of academic information and long-term availability), the idea was born for a search engine configured specifically for academic use. We also hoped that with one single access point founded on improved search engine technology, we could access the heterogeneous academic resources of subject-based bibliographic databases, catalogues, electronic newspapers, document servers and academic web pages. Software evaluation and technical realisation Following on from our fundamental deliberations about an academic search engine, we searched the market for suitable software products. Our discussions with Google in 2002 broke down at an early stage, as we were only able to speak to sales personnel, and at that time at least, we received no indication that we could install Google software for testing locally. We found the situation different with the search engine Convera, and we were able to install their search engine on a machine in Bielefeld and test it for a limited period. We spent two weeks intensively observing the Convera software and concluded that it was more appropriate for an intranet installation than it was for our application of it as an Internet search engine. We also tested the Russian open source search engine MnoGo and found many positive aspects to that software, but in our tests we also encountered performance problems when trying to process large amounts of data. Finally, we contacted the Norwegian software company Fast, which in 2002 was one of the market leaders alongside Google with the Fast search engine Alltheweb. A test installation was quickly and flexibly agreed, the technical realisation of which also succeeded smoothly and without any problems. Our experience with the Fast software was so positive that by the end of the test period, it was clear that we should carry out a proof-of-concept with this search Search Engine Technology and Digital Libraries: Moving from Theory t... http://bieson.ub.uni-bielefeld.de/volltexte/2004/576/html/09lossau.html 1 von 5 09.09.2010 15:09 engine. Within the context of the proof-of-concept we aimed to bring together and make available a representative and heterogeneous amount of academic on-line material. Various document types and formats (both full-text and metadata), and the contents of the visible and invisible web were to be included. As a basic condition, we emphasised that the test should be carried out under live conditions on the basis of Fast Data Search. In addition, interoperability standards (OAI, XML) should be followed and prototypes of an intelligent and flexible user interface developed. The technical work in Bielefeld University Library began in earnest in the summer of 2003. The technical core team has consisted of two software developers, who have been creating prototypes based on the FAST Data Search software. A concrete start was made with the realisation of a "Math Demonstrator", which—through its concentration on a single subject area—should form the basis for further development and discussion. In spring 2004 this subject based approach was extended with the aim to create a general "Digital Collections Demonstrator". Both demonstrators had their public launch in June 2004 with the establishment of the Bielefeld Academic Search Engine (BASE, http://base.ub.uni-bielefeld.de/). The work in Bielefeld has also formed the base for a collaborative project proposal "Search engine technology" to the Deutsche Forschungsgemeinschaft by Bielefeld University Library and the Regional Service Centre for Academic Libraries in North RhineWestphalia (HBZ, Cologne) as part of the Distributed Document Server (VDS) initiative [1]. This joint project is one way of extending the activities in Bielefeld and is expected to kick off at the end of 2004. Technical details of the search engine solution The technical structure of the FAST search engine is modular and transparent in construction and includes the standalone system components of a back-end and a front-end server. At the moment, Bielefeld is running one front-end server, but it could easily run more. Likewise the number of back-end servers is scaleable, and both areas can be rebuilt to create a multi-node system without creating any problems. The front-end handles the tasks of providing the search environment, results analysis and presentation, and at the moment it is running on a Linux PC with 2 processors under SUSE 9.0. Connection to the web is established using PHP 4 on an Apache Web Server. The back-end deals with the areas of data loading, pre-processing and data conversion, data assimilation, crawling and document processing and indexing. The user interface is bi-lingual (German and English). Next to the basic search form with Google-like single line search boxes is an advanced search option (see Figure 1) with additional functionality, which is the focus of the software development. It is here that the supplementary functions, such as refined and restricted searching, choice of collections and search history, are offered. Both search forms allow the search to be restricted to documents that are freely available. Search Engine Technology and Digital Libraries: Moving from Theory t... http://bieson.ub.uni-bielefeld.de/volltexte/2004/576/html/09lossau.html 2 von 5 09.09.2010 15:09 Figure 1: Advanced search screen The results page (Figure 2) differs from the search engine standard in that it shows a sophisticated display of metadata whenever metadata is present in the document. Next to the displayed search results there is an interactive area where the user can refine the search (for example, by using the metadata to search for author and classification and for formal aspects such as document format and collection). The appropriate fields from all the results are formed into a dropdown menu. It is possible to do further searches with respect to a single document by searching for similar documents in the master index (find similar), within the search results (refine similar) or by excluding similar documents in the search results (exclude similar). The search history option completes the current capabilities of the user interface. Figure 2: Results page The back-end system is a live system running on a Linux PC under SUSE Linux 9.0 with two processors and a RAID system with a hard disk capacity of 290GB, and in parallel a test system is running, likewise on a Linux PC, that allows internal alterations to be made for development purposes without affecting the live system. The contents of the collection As of June 2004, approximately 600,000 documents had been captured and distributed in 15 collections. The back-end server requires about 25GB free space for this. The choice of captured sources was made with the aim of capturing representative data of different types. In doing so, the following were processed:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet

This article is the revised and elaborated version of a presentation that was delivered at the invitation of the American Digital Library Federation (DLF) at their Spring Forum meeting in New Orleans (http://www.diglib.org/forums/Spring2004/springforum04abs.htm). It will be followed by "Search engine technology and digital libraries: Moving from theory to praxis" as a collaborative article from...

متن کامل

Critical Success Factors of Digital Libraries in Iran: A Qualitative Research

Background and Aim: Myriad of IT projects failed in recent years. Digital libraries (DLs) as the product of the usage of IT in the library organization followed a similar trend. This paper studies the critical success factors (CSFs) of DLs in the context of Iran, with special focus on the Iranian Ministry of Science, Research, and Technology. CSFs, in this paper, are those factors that if follo...

متن کامل

مدیران در ایجاد کتابخانه دیجیتالی به دنبال چه اهدافی هستند؟

The aim of this study is to identify the goals of governmental organizations in developing digital libraries. The used methodology is Grounded Theory. This study used three step systematic method of the mentioned research method. Interview with 11 digital library managers help us to collect data. The collected data filtered by coding steps and demonstrated by Paradigm model. Results showed that...

متن کامل

The MINERVA1 Project: Towards Collaborative Search in Digital Libraries Using Peer-to-Peer Technology

We consider the problem of collaborative search across a large number of digital libraries and query routing strategies in a peer-to-peer (P2P) environment. Both digital libraries and users are equally viewed as peers and, thus, as part of the P2P network. Our system provides a versatile platform for a scalable search engine combining local index structures of autonomous peers with a global dir...

متن کامل

Using Interactive Search Elements in Digital Libraries

Background and Aim: Interaction in a digital library help users locating and accessing information and also assist them in creating knowledge, better perception, problem solving and recognition of dimension of resources. This paper tries to identify and introduce the components and elements that are used in interaction between user and system in search and retrieval of information in digital li...

متن کامل

Future-oriented implications of the resilience theory for Iran public libraries

Target: In order to play their role in social developments, public libraries face technological changes and unknown issues that can affect their identity and mission .In reference to the application of novel approaches to reconceptualize the mission of public libraries, this study tries to employ resilience theory to craft a vision for the future of Iran public libraries. Method: This study u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • D-Lib Magazine

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2004